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DETAILED ACTION 

The following new grounds of rejection are in response to a decision to re-open 
prosecution pursuant to the Request for a Pre-Appeal Conference filed by Applicant on 
21 December 2007. It is maintained that Thambiratnam et al. ("Speech Recognition in 
Adverse Environments using Lip Information") provides a somewhat better disclosure 
for purposes of appeal for the features of "detecting if the audio signals can be 
processed" and "processing the video signals based on a detection that at least a 
portion of the audio signal cannot be processed". The finality of Office Action is 
withdrawn, and the rejection is NON-FINAL. 

Claim Rejections - 35 USC § 103 

1 . The following is a quotation of 35 U.S.C. 103(a) which forms the basis for all 
obviousness rejections set forth in this Office action: 

(a) A patent may not be obtained though the invention is not identically disclosed or described as set 
forth in section 102 of this title, if the differences between the subject matter sought to be patented and 
the prior art are such that the subject matter as a whole would have been obvious at the time the 
invention was made to a person having ordinary skill in the art to which said subject matter pertains. 
Patentability shall not be negatived by the manner in which the invention was made. 

2. Claims 1 to 3, 5 to 7, 9 to 1 1 , and 1 3 to 1 5 are rejected under 35 U.S.C. 1 03(a) 
as being unpatentable over Morris in view of Thambiratnam et al. ("Speech Recognition 
in Adverse Environments using Lip Information"). 

Concerning independent claims 1 , 5, and 9, Morris discloses a speech 
recognition method, device, and system, comprising: 
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"an audio signal receiver configured to receive audio signals from a speech 
source" - a user speaks to system 100, and system 100 captures the user's speech 
with speech input unit 104 (column 4, lines 15 to 19: Figures 1 and 2: Block 202); 
speech is an audio signal; 

"a video signal receiver configured to receive video signals from the speech 
source" - a user speaks to system 100, and system 100 captures the user's image with 
video input unit 102 (column 4, lines 15 to 19: Figures 1 and 2: Block 202); 

"a processing unit configured to process the audio signals and the video signals" 
- system 100 combines any captured speech or video and proceeds to process the 
combined data stream in multi-sensor fusion/recognition unit 106 (column 4, lines 20 to 
24: Figures 1 and 2: Block 204); 

"a conversion unit configured to convert at least one of the audio signals and the 
video signals to recognizable information" - system 100 interprets any verbal input 
using the speech recognition functions of multi-sensor fusion/recognition unit 1 06; 
speech recognition is supplemented by visual information captured by video input unit 
102, such as any interpreted facial expressions (e.g., lip-reading); a list of spoken words 
is generated from the verbal input (column 4, lines 25 to 31 : Figures 1 and 2: Block 
206); spoken words are recognizable information; 

"an implementation unit configured to implement a task based on the 
recognizable information" - system 100 provides a response based upon whether the 
user has asked a question or made a statement; if a user has asked a question, then 
system 100 searches knowledge database 1 16 for a response to the objective question; 
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a user may ask: "What is the weather in Phoenix, today?"; system 100 retrieves an 
answer, and the information is communicated as output via computer monitor and 
speakers (column 4, line 56 to column 5, line 24: Figure 3: Blocks 306, 308, 310, 312, 
322); responding to a question by searching a knowledge database for a weather report 
in Phoenix, and outputting the weather report, is equivalent to implementing a task. 

Concerning independent claims 1 , 5, and 9, the only elements arguably omitted 
by Morris are "detecting if the audio signal can be processed", processing the audio 
signals "if it is detected that the audio signals can be processed", and processing the 
video signals "if it is detected that at least a portion of the audio signal cannot be 
processed". Morris discloses processing both the audio and video signals. for multi- 
sensor fusion, so that better recognition can be obtained from speech input and video 
input. Fundamentally, one having ordinary skill in the art would readily understand that 
a speech recognizer that utilizes both audio and video for purposes of recognition would 
utilize the video if the quality of the audio information is poor, and utilize the audio if the 
quality of the audio information is good. Specifically, Thambiratnam et at. teaches 
speech recognition in adverse environments, where asynchronous integration merges 
the results of two systems together to produce a combined probability: 

P C = AP A + P-AP V ), 

where P A represent the acoustic score from the acoustic subsystem, P v represents the 
visual scores from the video subsystem, and A is a weighting parameter that depends 
on the signal-to-noise (SNR) ratio. (§4.2 Asynchronous Integration: Pages 150 to 151: 
Figure 3) Moreover, Figure 4 illustrates performance accuracy as a function of SNR, 
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where the visual subsystem performs at the same error rate of approximately 85%, 
regardless of the SNR, but that the acoustic subsystem performance degrades rapidly 
as the SNR decreases. In fact, Figure 4 shows that for SNR < 5, a video subsystem will 
provide better accuracy than any of the acoustic subsystems of Mel-Cepstral, RASTA, 
or Mel- RASTA. (§5.1 Individual Sub-System Performance: Page 151: Figure 4) 
(Setting the weighting parameter, A=0, corresponds to processing only the video 
signals.) Thus, one skilled in the art would have found it "obvious to try" processing the 
video signals based on a detection that at least a portion of the audio signal cannot be 
processed due to a low signal-to-noise ratio as taught by Thambiratnam et al. It would 
have been obvious to one having ordinary skill in the art to process the video signals 
based on a detection that at least a portion of the audio signal cannot be processed as 
suggested by Thambiratnam et al. in a multi-sensor fusion/recognition unit of Morris for 
a purpose of improving an accuracy of speech recognition in adverse environments for 
conditions of low signal-to-noise ratios. 

Concerning independent claims 13 to 15, similar considerations apply as to 
independent claims 1 , 5, and 9. Implicitly, the signal-to-noise ratio must be a function of 
time, and the audio and video segments coincide in time, so that Thambiratnam et al. 
would process the audio and video as segments coinciding in time. 
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Concerning claims 2, 6, and 10, Morris discloses that video input unit 102 
receives face/voice expressions and interpreted facial expressions including lip-reading 
(column 4, lines 27 to 30: Figures 1 and 2). 

Concerning claims 3, 7, and 1 1 , Morris discloses that, in one embodiment, 
processing by multi-sensor fusion recognition unit 1 06 is split into three parallel 
processes to minimize time of processing (column 4, lines 20 to 24: Figures 1 and 2). 

3. Claims 4, 8, and 1 2 are rejected under 35 U.S.C. 1 03(a) as being unpatentable 
over Morris in view of Thambiratnam et al. ("Speech Recognition in Adverse 
Environments using Lip Information") as applied to claims 1 , 5, and 9 above, and further 
in view of Bakis et al. 

Morris does not expressly disclose a storage unit for storing the audio signals 
and the video signals to a destination source, and a transmitter for sending the audio 
signals and the video signals to a destination source. However, it is well known to 
operate biometric identification via a client/server network, where biometric data is 
stored on a server, and biometric data is collected locally but compared to stored 
biometric data on the server. Bakis et al. teaches an analogous art method and 
apparatus for recognizing the identity of individuals by a speaker recognition system 
and a lip classifier, where biometric attributes are pre-stored for later retrieval so that 
they may be compared. Further, a server is included for interfacing with a plurality of 
biometric recognition systems to receive requests for biometric attributes therefrom and 
transmit biometric attributes thereto. The server has a memory device for storing the 
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biometric attributes. (Column 8, Line 47 to Column 9, Line 16) Objectives are to 
provide a significant increase in the degree of accuracy of recognition and to provide a 
significant reduction in fraudulent or errant access to a service and/or facility. It would 
have been obvious to one having ordinary skill in the art to store and send biometric 
attributes to a server ("a destination source") as taught by Bakis et al. in a method, 
device, and system for combining audio and video signals of Morris for purposes of 
increasing accuracy of recognition and reducing fraudulent access. 

4. Claims 1 9 to 21 are rejected under 35 U.S.C. 1 03(a) as being unpatentable over 
Morris in view of Thambiratnam et al. ("Speech Recognition in Adverse Environments 
using Lip Information") as applied to claims 1 , 5, and 9 above, and further in view of 
Brunelli et al. 

Morris omits "determining if the video images of the user are detected", and 
"indicating to the user if the video image is not detected." However, one having ordinary 
skill in the art would understand that if the camera does not properly capture a face of a 
speaker in a method and apparatus for audio-visual speech recognition, then the 
camera would need to be adjusted. Specifically, Brunelli et al. teaches an integrated 
multisensory recognition system for speaker-recognition and visual-features recognition 
(Abstract), where an attention module 9 is sensitive to a signal provided by a television 
camera 3. When attention module 9 detects a face due to the arrival of a person P in 
front of television camera 3, a snapping module 10 waits until a scene in front of 
television camera 3 has stabilized, and checks that certain elementary condition are 
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satisfied. When snapping module 10 has verified the existence of conditions of stability 
of a framed image, an acoustic indicator or loud speaker asks person P to utter certain 
words to initiate multisensory recognition. (Column 4, Line 50 to Column 5, Line 34: 
Figure 2) Thus, person P, or "the user", is notified when his/her images are not 
detected because an acoustic indicator does not prompt the user to speak the words; a 
user only hears an audio indication when his/her image is captured, so an absence of a 
prompt is equivalent to an indication that the video image was not detected. An 
objective is to combine acoustic and visual data in an optimal manner that reduces 
probabilities of error to a minimum. (Column 2, Lines 3 to 10) It would have been 
obvious to one having ordinary skill in the art to provide a feature of notifying a user if a 
video image is not detected as taught by Brunelli et al. in a method and apparatus of 
multi-sensor fusion/recognition of Morris for a purpose of combining acoustic and visual 
data in an optimal manner that reduces probabilities of error to a minimum. 

Allowable Subject Matter 

5. Claims 1 6 to 1 8 are objected to as being dependent upon a rejected base claim, 
but would be allowable if rewritten in independent form including all of the limitations of 
the base claim and any intervening claims. 

6. The following is a statement of reasons for the indication of allowable subject 
matter: 

The prior art of record does not disclose or reasonably suggest the limitation of 
defining an error threshold, comparing a number of detected errors in an audio signal 
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with the threshold, and determining that the audio signals cannot be processed if the 
number of errors equals or exceeds the threshold. The prior art of record suggests a 
parameter involving a signal-to-noise ratio for an audio signal to determine that the 
quality of the audio signal is sufficient to obtain good accuracy for speech recognition, 
but does not compare a number of errors with a threshold. 

Response to Arguments 

7. Applicant's arguments filed 21 December 2007 have been considered but are 
moot in view of the new grounds of rejection. 

Conclusion 

8. The prior art made of record and not relied upon is considered pertinent to 
Applicant's disclosure. 

Connell et a)., Hershey et al., Teissier et al. ("Comparing Models for Audiovisual 
Fusion in Noisy-Vowel Recognition Task"), and Lucey et al. ("Improved Speech 
Recognition using Adaptive Audio- Visual Fusion via a Stochastic Secondary Classifier") 
disclose related art directed to audio-visual speech recognition. 

Any inquiry concerning this communication or earlier communications from the 
examiner should be directed to Martin Lemer whose telephone number is (571) 272- 
7608. The examiner can normally be reached on 8:30 AM to 6:00 PM Monday to 
Thursday. 
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If attempts to reach the examiner by telephone are unsuccessful, the examiner's 
supervisor, David R. Hudspeth can be reached on (571) 272-7843. The fax phone 
number for the organization where this application or proceeding is assigned is 571 - 
273-8300. 

Information regarding the status of an application may be obtained from the 
Patent Application Information Retrieval (PAIR) system. Status information for 
published applications may be obtained from either Private PAIR or Public PAIR. 
Status information for unpublished applications is available through Private PAIR only. 
For more information about the PAIR system, see http://pair-direct.uspto.gov. Should 
you have questions on access to the Private PAIR system, contact the Electronic 
Business Center (EBC) at 866-21 7-91 97 (toll-free). If you would like assistance from a 
: USPTO Customer Service Representative or access to the automated information 
system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. 

ML 

2/20/08 




Examiner 

Group Art Unit 2626 



